-
Notifications
You must be signed in to change notification settings - Fork 18
Add ADR for large message chunking in MQTT protocol #607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request introduces an Architectural Decision Record (ADR) detailing the proposed implementation of large message chunking in the MQTT protocol. The document outlines the context, decision rationale, protocol flow, benefits, and implementation considerations for handling oversized MQTT messages.
- Introduces a new ADR document.
- Describes the protocol flow for both sending and receiving large messages.
- Highlights implementation considerations including error handling, performance optimization, and security.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel a key question we need to answer is - when can the client chunk, vs not.
This will depend on whether the receiver is able to understand our chunking protocol.
So we need to enable this only when both sides of the communication pipe use the same mechanism. One example is mRPC. Telemetry can also be applicable, but I suppose telemetry could also be asymmetric?
Co-authored-by: Valerie Avva Lim <[email protected]>
Co-authored-by: Valerie Avva Lim <[email protected]>
cc206a6 to
4967150
Compare
Co-authored-by: Tim Taylor <[email protected]>
…d configuration settings
| The receiving client uses the Message Expiry Interval from the first chunk as the timeout period for collecting all remaining chunks of the message. | ||
|
|
||
| Edge case: | ||
| - Since the Message Expiry Interval is specified in seconds, chunked messages may behave differently than single messages when the expiry interval is very short (e.g., 1 second remaining). For a single large message, the QoS flow would complete even if the expiry interval expires during transmission. However, with chunking, if the remaining expiry interval is too short to receive all chunks, the message reassembly will fail due to timeout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for other message expiry calculations, we always round partial seconds up, never down - I imagine doing something similar here should maintain acceptable behavior.
To your statement about the QoS flow would (not) complete for the chunking scenario, what exactly do you mean by this? Just that the message might not be received by the end application if all chunks aren't delivered in time, or is there some ramification about the QoS flow not completing - we would definitely still need to ack all messages for the chunking scenario
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means it does not matter if the single message expires during QoS flow execution it will complete (even if it involves whole message resend), but in case of chunks at the very end of expiration interval it is possible some tail chunks would expire before their transfer started thus creating effect of the whole (original) message expires midflight and transfer canceled.
| Note over Receiver: Message Expiry Interval exceeded | ||
| Receiver->>Receiver: Timeout occurred | ||
| Receiver->>Receiver: Cleanup buffers | ||
| Note over Receiver: Notify application:<br/>ChunkTimeoutError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for these error cases, we would just log the error and ignore the received message, similar to any other invalid message that we've received. There's no action the application can take
| Note over Receiver: Message Expiry Interval exceeded | ||
| Receiver->>Receiver: Timeout occurred | ||
| Receiver->>Receiver: Cleanup buffers | ||
| Note over Receiver: Notify application:<br/>ChunkTimeoutError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we will need some new rpc errors for communicating chunking issues to the invoker in the command response
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(bad example as this one would be past when the invoker is listening, but for the buffer size full there should be some communication)
|
|
||
| **Chunk size calculation:** | ||
|
|
||
| - Maximum chunk size will be derived from the MQTT CONNECT packet's Maximum Packet Size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I like having the maximum size based off CONNACK.
- Lower prio - it maybe OK to punt in v1 - but you should think whether you want this to also be configured via a knob, where the size would be min(CONNACK calculation, knob setting).
- The scenario is that we could imagine say a low-end device that only had some small amount of RAM, where we only want to send is smaller chunks even if the MQTT broker allowed larger sizes.
| **The chunking mechanism will**: | ||
|
|
||
| - Be enabled/disabled by a configuration setting. | ||
| - Use standardized user properties for chunk metadata. The `__chunk` user property will contain a colon-separated string with chunking metadata: ```<messageId>:<chunkIndex>:<totalChunks>:<checksum>```. The string will include: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Speaking of versions - should this include a version in it of the chunking version?
- How does mRPC command/response handle versions of the mRPC protocol layer itself (which is different than the customer mRPC client/servers having the version field)? Can we leverage those concepts?
- As one case I'm imagining, we could imagine(!) a v1 of this protocol requires sender to know size up front but then we relax in a v2 to support streaming.
| - `messageId` - UUID string in the 8-4-4-4-12 format, present for every chunk. | ||
| - `chunkIndex` - unsigned 32 bit integer in decimal format, present for every chunk; | ||
| - `totalChunks` - unsigned 32 bit integer in decimal format, present only for the first chunk. | ||
| - `checksum` - SHA-256 hash in hexadecimal format (64 characters long), present only for the first chunk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Does anywhere else in mRPC use a
checksum?- If yes, then makes sense to keep here I guess.
- If no, do we need it here? We rely on TCP to do checksum checks for us. And this checksum would still only check the payload, the headers could (in theory) get corrupted still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, integrity check that travels with the data it's protecting can be spoofed by an attacker who can modify both.
| **Configuration settings:** | ||
| - Enable/Disable | ||
|
|
||
| ### Implementation Considerations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Outside the scope of this ADR and potentially post-2510 even. But it's on my mind so recording here.
- We need to think about who the customers of this can/should be.
If SDKs do most of work it should be an easy discussion. But on my mind:- Schema Registry team - maybe. Even on a tiny configuration size, max message is still 4MB which I think should cover vast majority of schemas. Though maybe not - OPC UA schemas get big.
- Tinykube | WASM - 100%! These WASM modules can be really large - like dozens of MB. And since both the Tinykube server and client are owned by Microsoft, should be really easy to add this.
- Dataflows - ??. I don't know actually since I don't have a sense of how large messages we're getting.
- OPC UA / connectors - ??. It depends on what we do with Dataflows as Dataflows is one of their main customers.
|
|
||
| ```mermaid | ||
| sequenceDiagram | ||
| participant Sender as Sending Client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Retry on the sender?
- It looks like the sending client has no way of knowing whether its upload was actually processed by the receiver.
- This is the same as any MQTT PUBLISH / telemetry of course - getting a PUBACK only tells PUBLISHER so much.
- The question is - is that OK for this case? Or do we want to let clients be more robust?
- So some sort of mRPC type response-topic construct - maybe sent in the 1st message - where the caller can indicate success|failure?
- We should understand scenarios though - we may indeed not need this.
- Like for Tinykube, a client will be initiating "please TK server download for me this giant WASM module" and then the TK client will listen for chunked-response.
- So if chunked-response doesn't come in time, then TK client would quest retry the "give me WASM module" rather than relying on it to retry.
- So something to think about, and if we don't think we need it I think its worth calling out in design that we don't need it & why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that QoS for the original message will be applied to all chunks wouldn't that give us needed control over delivery guarantee?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's always possibility of timeout - I think MQ sets messages with 24 hour timeouts by default though I may be misremembering that.
Or we could imagine a client saying "timeout = 2 minutes" for say the entire operation, since maybe if this fails they want to retry or at least signal an error to the caller / user in a more timely manner? For that I think you need an message to a response-topic on initiator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused after our last meeting (now when chunking is unconditional and automatic) say I know my message is not going to be chunked then QOS give me all I need, on the other hand if I know that my message will be chunked, RPC with underlying chunking MQTT client gives desired level of the delivery control on the sender side (probably with some extended information on specific chunking failure in the InnerException):
- RPC calls mqttClient.PublishAsync(completeMessage)
- Chunking layer splits into chunks and sends them
- Something goes wrong with chunking (partial failure)
- RPC only gets the final result: success or failure of the entire message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the question correctly, the answer is that the receiving side's chunking layer would time out waiting for the final chunk of the RPC call and would not notify the user that any RPC was invoked. The RPC layer would only "count" an RPC call that has received all chunks successfully
|
|
||
| ## Context | ||
|
|
||
| The MQTT protocol has inherent message size limitations imposed by brokers and network constraints. Azure IoT Operations scenarios often require transmitting payloads that exceed these limits (e.g., firmware updates, large telemetry batches, complex configurations). Without a standardized chunking mechanism, applications must implement their own fragmentation strategies, leading to inconsistent implementations and interoperability issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean the ~256 MB message size limit?
ADR for Large Message Chunking
Transcript March 24, 2025